In this project we are going to analyze The Movie Database (TMDb) Dataset.This data set contains information about 10,000 movies collected from The Movie Database , including user ratings and revenue. We will try to answer the following questions:
#Import packages
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas.util.testing as tm
sns.set_style('darkgrid')
# load the movies Data
movies_df = pd.read_csv('tmdb-movies.csv')
movies_df.head(5)
movies_df.describe()
movies_df.shape
movies_df.info()
Based on the Info of the data, we have some columns not required in our analysis. so removing the columns like : homepage, tagline, keywords, overview, imbd_id, cast
In this Section , We will be droping the columns like cast, homepage, tagline, keywords, overview. We will convert the data types for release_date and release_year columns. We will count the duplicates and remove the duplicates. replacing the null values with mean for numeric and mode for categorical columns.
# Drop the columns mentioned above
movies_df.drop(["imdb_id","cast","homepage","tagline","keywords","overview"], axis=1, inplace = True)
movies_df.head()
# Checking Duplicates in the data
sum(movies_df.duplicated())
#Removing the duplicated values
movies_df.drop_duplicates(inplace= True)
sum(movies_df.duplicated())
# Checking Null Values in the data
movies_df.isna().sum()
# Remove null values from the data
movies_df=movies_df.dropna(how='any')
movies_df.isna().sum()
movies_df.shape
#Converting release_date(object datatype) to date
movies_df['release_date']=pd.to_datetime(movies_df['release_date'])
#Converting the budget_adj , Revenue_adj into int data type
movies_df['budget_adj']=movies_df['budget_adj'].astype(int)
movies_df['revenue_adj']=movies_df['revenue_adj'].astype(int)
#Chaecking the datatypes changed
movies_df.info()
movies_df.hist(figsize=(15,10));
pd.plotting.scatter_matrix(movies_df, figsize=(15,12));
movies_df['release_year'].value_counts().plot(kind='bar',figsize=(15,12));
### Checking outliers using box plots
con_var =['popularity','budget','revenue','runtime','vote_count',
'vote_average','budget_adj','release_year','revenue_adj']
movies_df[con_var].boxplot(return_type = 'axes',figsize =(20,12))
Here We can see the outiers present in the variables revenue, revenue_adj, profits. As we know based on the popularity the revenue will be depended . so we neglect the outlier in this analysis
From the above graph, in 2014 Maximum number of movies are released
movies_df['production_companies'].value_counts()
Paramount Pictures Production company has produced maximum pictures
movies_df['profits'] = movies_df['revenue']-movies_df['budget']
movies_df.head(2)
df_p = movies_df.loc[:,["production_companies","budget", "original_title", "revenue","profits"]]
df_p.head()
idx= df_p['profits'].idxmax()
idx
df_p.loc[idx]
ix = df_p['profits'].idxmin()
ix
df_p.loc[ix]
"Avatar" is the movie which receives maximum profits and "The Warriors Way" has least profits
#splitting the geners data seperated by ""|""
genres_df = movies_df
genres_df = genres_df.drop('genres', axis=1).join(genres_df['genres'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('genres'))
#reseting the index of the dataset
genres_df.reset_index(inplace = True)
genres_df.head()
# creating bin edges for release year to convert them as decades
bin_edges =['1960','1970','1980','1990','2000','2010','2020']
bin_labels = ['1960','1970','1980','1990','2000','2010']
genres_df['release_decade']= pd.cut(genres_df['release_year'], bin_edges, labels = bin_labels)
genres_df.head()
genres_df.info()
plt.figure(figsize=(10,8));
sns.countplot(genres_df["genres"]);
plt.xticks(rotation=90);
plt.title("Count of different Genres from 1960 to 2010")
Out of all genres, Drama genre has highest number of releases over the period of time.
decades = ["1960","1970","1980","1990","2000","2010"]
for i in decades:
plt.figure(figsize=(11,8));
#plt.subplots(1,2)
df = genres_df[genres_df["release_decade"]== i]
#plt.subplots()
#fig.set_size_inches(11.7, 8.27)
sns.countplot(df["genres"]);
plt.xticks(rotation=90);
plt.title("Count of different genres Movies released in year "+i)
Over the period of decades, the Drama genre movies are released more followed by comedy genre
# selecting the release decade, genre and popularity mean
df_dec=genres_df.groupby(['release_decade','genres'], as_index = False)['popularity'].mean()
df_dec
# plotting genres type over the decades and popularity
for i in genres_df["genres"].value_counts().index:
lables=["1960","1970","1980","1990","2000","2010"]
Data= df_dec[df_dec["genres"]==i]
plt.plot(Data["release_decade"],Data["popularity"])
plt.xticks(lables)
plt.title("Genre Type "+i)
plt.xlabel('release decade')
plt.ylabel('popularity')
plt.show()
Over the period of time, most of the genres become popular except few genres like foreign, Tv movie. Remaining all genres popularity incresed gradually as time moves on.
df_genres = df_dec.pivot("release_decade", "genres", "popularity")
X = np.arange(4)
df_genres.plot(kind='bar',width=0.8,figsize=(17,8), edgecolor = "black")
plt.legend(bbox_to_anchor=(1.2,1), loc='upper right', ncol=1,title="Genres")
plt.title("Popularity rating of movies by genres over decades")
plt.xlabel("decade")
plt.ylabel("Popularity rating")
plt.show()
As we observe from the above graph,genres like Action, Adventure, comedy, gradually increases popularity every decade. Adventure genre has high popularity rating over all decades and foreign type genre has least popularity rating over all decades
df = genres_df[['original_title','profits','genres']]
df.head()
df1 = df.groupby(['genres']).mean()
df1['profits_million'] = df1['profits']/1000000
del df1['profits']
df1.sort_values('profits_million', ascending=False, inplace = True )
df1[['profits_million']].plot.barh(stacked=True, title = 'Genres by profit (US$ million)', figsize=(12, 8));
Adventure genre has more profits followed by fantasy, animation, family respectively.
genres_df.head()
# selecting the columns required fro analysis
df_1 = genres_df[['original_title','genres','runtime']]
df_1
# calculating the genres column mean using groupby
df1 = df_1.groupby(['genres']).mean()
df1['Average_duration'] = df1['runtime'].round(2)
del df1['runtime']
df1.sort_values('Average_duration', ascending=False, inplace = True )
df1[['Average_duration']].plot.barh(stacked=True, title = 'Average_duration by Genre', figsize=(12, 8));
History genre type average duration is highest followed by War and Animation has least Average duration.
# heat map for dataset
sns.heatmap(movies_df.corr(), annot=True, linewidths=.5, fmt='.1f');
plt.title("Correlation Heat map");
# creating function for scatter plots
def scat(data):
x = movies_df["revenue"]
y = movies_df[data]
plt.scatter(x, y)
plt.title("revenue vs "+data)
plt.xlabel("revenue")
plt.ylabel(data)
plt.show()
num_cols = ['int64','float64']
columns = movies_df.drop(columns=["revenue"]).select_dtypes(include=num_cols).columns
j = 1
for i in columns:
scat(i)
Above, we have plotted the scatter plots between revenue vs remaining numeric columns. Budget , popularity, runtime, vote_count, profits have positive correlation with Revenue. As revenue is main variable reflects the profits. so these all variables are associated with the profitable movies.
We are done with data Analysis on Tmdb-movies data and find out answers for questions we have posed:
1) paromunt pictures produced maximum number of movies.
2) "Avatar" receives maximum profits and " The Warriors Way" got least profits.
3) Drama Genre is more popular over the period of time. Adventure and action genres receive
more popularity over the time.
4) Adventure genre receives maximum profits and History genre type movies has maximum average duration.
5) we have found the variables associated with the profits.
1) Missing values present in the data set which we successfully removed. 2) There were duplicate rows present which were cleaned. 3) The budget and currency column do not have currency unit